Utilizing corpus statistics for hindi word sense disambiguation
نویسندگان
چکیده
Word Sense Disambiguation (WSD) is the task of computational assignment of correct sense of a polysemous word in a given context. This paper compares three WSD algorithms for Hindi WSD based on corpus statistics. The first algorithm, called corpus-based Lesk, uses sense definitions and a sense tagged training corpus to learn weights of Content Words (CWs). These weights are used in the disambiguation process to assign a score to each sense. We experimented with four metrics for computing weight of matching words Term Frequency (TF), Inverse Document Frequency (IDF), Term Frequency-Inverse Document frequency (TF-IDF) and CW in a fixed window size. The second algorithm uses conditional probability of words and phrases co-occurring with each sense of an ambiguous word in disambiguation. The third algorithm is based on the classification information model. The first method yields an overall maximum precision of 85.87% using TF-IDF weighting scheme. The WSD algorithm using word co-occurrence statistics results in an average precision of 68.73%. The WSD algorithm using classification information model results in an average precision of 76.34%. All the three algorithms perform significantly better than direct overlap method in which case we achieve an average precision of 47.87%.
منابع مشابه
Word Sense Disambiguation in Bengali applied to Bengali-Hindi Machine Translation
We have developed a word sense disambiguation(WSD) system for Bengali language and applied the system to get correct lexical choice in Bengali-Hindi machine translation. We are not aware of any existing system for Bengali WSD. Since there is no sense annotated Bengali corpus or sufficient amount of parallel corpus for Bengali-Hindi language pair, we had to use an unsupervised approach. We use a...
متن کاملWord Sense Disambiguation in Hindi Language Using Hyperspace Analogue to Language and Fuzzy C-Means Clustering
The problem of Word Sense Disambiguation (WSD) can be defined as the task of assigning the most appropriate sense to the polysemous word within a given context. Many supervised, unsupervised and semi-supervised approaches have been devised to deal with this problem, particularly, for the English language. However, this is not the case for Hindi language, where not much work has been done. In th...
متن کاملAn Investigation to Semi supervised approach for HINDI Word sense disambiguation
This paper investigates yarowsky algorithm for Hindi word sense disambiguation. The evaluation has been developed o n a manually created sense tagged corpus consisting of Hindi words (nouns). The sense definition has been obtained from Hindi Word Net, which is developed at I I T B o m b a y . The maximum observed prec is ion o f 61.7 on 605 tes t ins tances corresponds to the case when both ste...
متن کاملCross-Lingual Word Sense Disambiguation
Word Sense Disambiguation using Cross-Lingual approach has been used successfully for languages like Farsi and Hindi. However, a comparable corpus in the form of Wikipedia articles available in English and Hindi has been used for such a task. This motivated us to further the approach and test the results when a parallel corpus is used. In this project, we specifically wanted to observe if the a...
متن کاملSome Challenges of Automated Annotation in A Multilingual Scenario
A key ingredient of today’s NLP scenario is annotation and this paper discusses challenges involved in one of the toughest annotation tasks which is sense marking. A large amount of data needs to be sense marked accurately by human annotators in order to train the machine to understand the spoken languages. The sense marked corpus for various languages facilitate the task of Word Sense Disambig...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Int. Arab J. Inf. Technol.
دوره 12 شماره
صفحات -
تاریخ انتشار 2015